
Background Problem¶
The pharmaceutical industry faces significant challenges in achieving accurate demand forecasting, which is crucial for effective inventory management, cost control, and ensuring timely product availability. While pharmacies and distributors heavily utilize point-of-sale (POS) data to monitor pharmaceutical product movement, issues such as stockouts, overstocking, and the accumulation of expired inventory remain prevalent. These problems often stem from the complex and volatile nature of pharmaceutical demand, which is influenced by a multitude of factors including seasonal disease patterns, public health campaigns, regulatory changes, and sudden market disruptions like epidemics.
Traditional forecasting methods, such as ARIMA, have been employed to address these challenges, but their effectiveness can be limited by the non-stationarity of sales data and the need for careful parameter tuning. Moreover, these methods may struggle to adequately capture the impact of promotional activities or other external factors that significantly influence demand. As Abolghasemi et al. (2021) highlight, the choice between using POS data and order data for forecasting is not straightforward and depends on various factors, including the characteristics of the time series itself. Factors such as the mean, variance, non-linearity, and entropy of POS data, as well as the presence of promotions, can significantly affect forecasting accuracy.
Therefore, there is a need to explore and evaluate alternative forecasting approaches, such as Facebook Prophet, which is designed to handle data with strong seasonality and irregular events, and to systematically investigate the impact of time series characteristics on the performance of different forecasting models in the pharmaceutical context. A more nuanced understanding of these factors can lead to improved forecasting accuracy and better-informed inventory and supply chain decisions.
Objectives¶
This study aims to:
- Analyze POS data of pharmaceutical product sales to identify temporal patterns and seasonal trends.
- Implement and evaluate ARIMA and Prophet models for forecasting pharmaceutical sales.
- Compare the forecasting performance of both models using appropriate evaluation metrics (e.g., RMSE, MAE, MAPE).
- Identify practical advantages and limitations of ARIMA and Prophet in forecasting pharmaceutical sales data.
- Provide actionable insights for inventory and supply chain decision-making based on model forecasts.
Dataset Description¶
The dataset used in this study originates from a comprehensive collection of transactional records, initially comprising approximately 600,000 entries spanning a six-year period from 2014 to 2019. These records were extracted from a Point-of-Sale (POS) system implemented in an individual pharmacy and include essential information such as the exact date and time of each transaction, the brand name of the pharmaceutical product sold, and the quantity dispensed.
From the broader dataset, a subset of 57 pharmaceutical drugs was selected for focused analysis. These drugs were grouped and categorized according to the Anatomical Therapeutic Chemical (ATC) Classification System, covering the following therapeutic classes:
- M01AB – Non-steroidal anti-inflammatory and antirheumatic products (acetic acid derivatives and related compounds)
- M01AE – Non-steroidal anti-inflammatory and antirheumatic products (propionic acid derivatives)
- N02BA – Other analgesics and antipyretics (salicylic acid and derivatives)
- N02BE/B – Other analgesics and antipyretics (pyrazolones and anilides)
- N05B – Psycholeptics (anxiolytic drugs)
- N05C – Psycholeptics (hypnotics and sedatives)
- R03 – Medications for obstructive airway diseases
- R06 – Systemic antihistamines
To facilitate time series analysis, the sales data were aggregated into multiple temporal granularities, including hourly, daily, weekly, and monthly levels. Prior to analysis, the dataset underwent a preprocessing phase that addressed data quality issues. This included the detection and treatment of outliers as well as the imputation of missing values, ensuring the dataset's suitability for robust modeling and forecasting.
Methodology¶
Import Libraries¶
import datetime
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
import plotly.graph_objects as go
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
from sklearn.metrics import mean_absolute_error
from prophet import Prophet
import statsmodels.api as sm
from statsmodels.stats.diagnostic import acorr_ljungbox
import warnings
warnings.filterwarnings("ignore")
# Format the date as a string
formatted_date = datetime.date.today().strftime("%B %d, %Y")
text_block = f"""
This notebook was executed on: {formatted_date}.
Any results or analysis reflect the data and conditions as of this date.
"""
print(text_block)
This notebook was executed on: May 07, 2025. Any results or analysis reflect the data and conditions as of this date.
Load Dataset¶
We try to load the dataset by hourly, daily, weekly, and monthly approaches.
hourly_df = pd.read_csv("pharma-sales-data/saleshourly.csv")
hourly_df.head()
| datum | M01AB | M01AE | N02BA | N02BE | N05B | N05C | R03 | R06 | Year | Month | Hour | Weekday Name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1/2/2014 8:00 | 0.0 | 0.67 | 0.4 | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2014 | 1 | 8 | Thursday |
| 1 | 1/2/2014 9:00 | 0.0 | 0.00 | 1.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 2014 | 1 | 9 | Thursday |
| 2 | 1/2/2014 10:00 | 0.0 | 0.00 | 0.0 | 3.0 | 2.0 | 0.0 | 0.0 | 0.0 | 2014 | 1 | 10 | Thursday |
| 3 | 1/2/2014 11:00 | 0.0 | 0.00 | 0.0 | 2.0 | 1.0 | 0.0 | 0.0 | 0.0 | 2014 | 1 | 11 | Thursday |
| 4 | 1/2/2014 12:00 | 0.0 | 2.00 | 0.0 | 5.0 | 2.0 | 0.0 | 0.0 | 0.0 | 2014 | 1 | 12 | Thursday |
hourly_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50532 entries, 0 to 50531 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datum 50532 non-null object 1 M01AB 50532 non-null float64 2 M01AE 50532 non-null float64 3 N02BA 50532 non-null float64 4 N02BE 50532 non-null float64 5 N05B 50532 non-null float64 6 N05C 50532 non-null float64 7 R03 50532 non-null float64 8 R06 50532 non-null float64 9 Year 50532 non-null int64 10 Month 50532 non-null int64 11 Hour 50532 non-null int64 12 Weekday Name 50532 non-null object dtypes: float64(8), int64(3), object(2) memory usage: 5.0+ MB
daily_df = pd.read_csv("pharma-sales-data/salesdaily.csv")
daily_df.head()
| datum | M01AB | M01AE | N02BA | N02BE | N05B | N05C | R03 | R06 | Year | Month | Hour | Weekday Name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1/2/2014 | 0.0 | 3.67 | 3.4 | 32.40 | 7.0 | 0.0 | 0.0 | 2.0 | 2014 | 1 | 248 | Thursday |
| 1 | 1/3/2014 | 8.0 | 4.00 | 4.4 | 50.60 | 16.0 | 0.0 | 20.0 | 4.0 | 2014 | 1 | 276 | Friday |
| 2 | 1/4/2014 | 2.0 | 1.00 | 6.5 | 61.85 | 10.0 | 0.0 | 9.0 | 1.0 | 2014 | 1 | 276 | Saturday |
| 3 | 1/5/2014 | 4.0 | 3.00 | 7.0 | 41.10 | 8.0 | 0.0 | 3.0 | 0.0 | 2014 | 1 | 276 | Sunday |
| 4 | 1/6/2014 | 5.0 | 1.00 | 4.5 | 21.70 | 16.0 | 2.0 | 6.0 | 2.0 | 2014 | 1 | 276 | Monday |
weekly_df = pd.read_csv("pharma-sales-data/salesweekly.csv")
weekly_df.head()
| datum | M01AB | M01AE | N02BA | N02BE | N05B | N05C | R03 | R06 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1/5/2014 | 14.00 | 11.67 | 21.3 | 185.95 | 41.0 | 0.0 | 32.0 | 7.0 |
| 1 | 1/12/2014 | 29.33 | 12.68 | 37.9 | 190.70 | 88.0 | 5.0 | 21.0 | 7.2 |
| 2 | 1/19/2014 | 30.67 | 26.34 | 45.9 | 218.40 | 80.0 | 8.0 | 29.0 | 12.0 |
| 3 | 1/26/2014 | 34.00 | 32.37 | 31.5 | 179.60 | 80.0 | 8.0 | 23.0 | 10.0 |
| 4 | 2/2/2014 | 31.02 | 23.35 | 20.7 | 159.88 | 84.0 | 12.0 | 29.0 | 12.0 |
Exploratory Data Analysis¶
Comparison Drug Products Over Time¶
In this section, we will look at the comparison of total sales over time, based on the time frame to be observed, be it per hour, per day, per week, to per month.
# Convert datatype of datum
hourly_df['datum'] = pd.to_datetime(hourly_df['datum'])
daily_df['datum'] = pd.to_datetime(daily_df['datum'])
weekly_df['datum'] = pd.to_datetime(weekly_df['datum'])
hourly_df.set_index('datum', inplace=True)
# Create figure
fig_hourly = go.Figure()
# Add a line trace for each drug category
categories = ['M01AB', 'M01AE', 'N02BA', 'N02BE', 'N05B', 'N05C', 'R03', 'R06']
for col in categories:
fig_hourly.add_trace(go.Scatter(
x=hourly_df.index,
y=hourly_df[col],
mode='lines',
name=col
))
# Update layout
fig_hourly.update_layout(
title='Interactive Hourly Drug Sales by Category',
xaxis_title='Datetime (Hour)',
yaxis_title='Sales',
hovermode='x unified',
template='plotly_white',
width=1200,
height=600
)
# Show plot
fig_hourly.show()